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Abstract: Undirected graphical models encode in a graph G the dependency structure 

' of a random vector Y. In many applications, it is of interest to model Y given another 

, random vector X as input. We refer to the problem of estimating the graph G{x) of 

psj ' Y conditioned on X = a; as "graph-valued regression." In this paper, we propose a 

^ ■ semiparametric method for estimating G{x) that builds a tree on the X space just as 

^ I in CART (classification and regression trees), but at each leaf of the tree estimates 

^ • a graph. We cah the method "Graph-optimized CART," or Go-CART. We study the 

. theoretical properties of Go-CART using dyadic partitioning trees, establishing oracle 

I inequalities on risk minimization and tree partition consistency. We also demonstrate 

I""!' the application of Go-CART to a meteorological dataset, showing how graph- valued 

^1 . regression can provide a useful tool for analyzing complex data. 
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1. Introduction 

Let y be a p-dimensional random vector with distribution P. A common way to study the 
structure of P is to construct the undirected graph G = {V,E), where the vertex set V 
corresponds to the p components of the vector Y. The edge set E is a. subset of the pairs 
of vertices, where an edge between Yj and Y^ is absent if and only if Yj is conditionally 
independent of Yk given all the other variables. Suppose now that Y and X are both random 
vectors, and let P{- \ X) denote the conditional distribution of Y given X. In a typical 
regression problem, we are interested in the conditional mean = K{Y\X = x). But 
if Y is multivariate, we may also be interested in how the structure of P{- \ X) varies as a 
function of X. In particular, let G{x) be the undirected graph corresponding to P(- \ X = x). 
We refer to the problem of estimating G{x) as graph-valued regression. 

Let Q = {G{x) : x G X} be a set of graphs indexed by x G A", where X is the domain of 
X. Then Q induces a partition of X, denoted as Xi, . . . , Xm, where xi and X2 lie in the same 
partition element if and only if G{xi) = G{x2)- Graph-valued regression is thus the problem 
of estimating the partition and estimating the graph within each partition element. 

We present three different partition-based graph estimators; two that use global optimiza- 
tion, and one based on a greedy splitting procedure. One of the optimization based schemes 
uses penalized empirical risk minimization; the other uses held-out risk minimization. As 
we show, both methods enjoy strong theoretical properties under relatively weak assump- 
tions; in particular, we establish oracle inequalities on the excess risk of the estimators, and 
tree partition consistency (under stronger assumptions) in Section 4. While the optimization 
based estimates are attractive, they do not scale well computationally when the input di- 
mension is large. An alternative is to adapt the greedy algorithms of classical CART, as we 
describe in Section 3.3. In Section 5 we present experimental results on both synthetic data 
and a meteorological dataset, demonstrating how graph- valued regression can be an effective 
tool for analyzing high dimensional data with covariates. 

2. Graph- Valued Regression 

Let ...,?/„ be a random sample of vectors from P, where each i/i G W. We are interested in 
the case where p is large and, in fact, may diverge with n asymptotically. One way to estimate 
G from the sample is the graphical lasso or glasso (Banerjee, Ghaoui and d'Aspremont, 2008; 
Friedman, Hastie and Tibshirani, 2007; Yuan and Lin, 2007), where one assumes that P is 
Gaussian with mean /i and covariance matrix S. Missing edges in the graph correspond to 
zero elements in the precision matrix Vt = (Edwards, 1995; Lauritzen, 1996; Whittaker, 
1990). A sparse estimate of Q is obtained by solving 

n = argmin{tr(^fi) - log + X\\n\\i} (2.1) 

where Q is positive definite, S is the sample covariance matrix, and = Xljfcl^ifcl 

the elementwise £i-norm of Q. Friedman, Hastie and Tibshirani (2007) develop a efficient 
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algorithm for finding Q tliat involves estimating a single row (and column) of Q in each 



iteration by solving a lasso regression. The theoretical properties of f2 have been studied 
by Rothman et al. (2008) and Ravikumar et al. (2009). In practice, it seems that the glasso 
yields reasonable graph estimators even if Y is not Gaussian; however, proving conditions 
under which this happens is an open problem. 

We briefly mention three different strategies for estimating G'(x), the graph of Y condi- 
tioned on X = X, each of which builds upon the glasso. 

Parametric Estimators. Assume that Z = {X, Y) is jointly multivariate Gaussian with 
covariance matrix 



We can estimate S^, Sy, and S^y by their corresponding sample quantities Ex, Sy, and 
Sxy, and the marginal precision matrix of X, denoted as Qx, can be estimated using the 
glasso. The conditional distribution of Y given X = x is obtained by standard Gaussian for- 
mulas. In particular, the conditional covariance matrix of F | X is T,y\x = Sy — J^yx^x^xy 
and a sparse estimate of flY\x can be obtained by directly plugging Sy|x into glasso. How- 
ever, the estimated graph does not vary with different values of X. 

Kernel Smoothing Estimators. We assume that Y given X is Gaussian, but without mak- 
ing any assumption about the marginal distribution of X. Thus Y \ X = x ^ N{fi{x), S(a:)). 
Under the assumption that both fi{x) and S(x) are smooth functions of x, we estimate S(x) 
via kernel smoothing: 



where is a kernel (e.g. the probability density function of the standard Gaussian distribu- 
tion), II ■ II is the Euclidean norm, /i > is a bandwidth and 



Now we apply glasso in (2.1) with S = S(x) to obtain an estimate of G{x). This method 
is appealing because it is simple and very similar to nonparametric regression smoothing; 
the method was analyzed for one-dimensional X by Zhou, Lafferty and Wasserman (2010). 
However, while it is easy to estimate G{x) at any given x, it requires global smoothness of 
the mean and covariance functions. It is also computationally challenging to reconstruct the 
partition Xi, . . . , X^. 

Partition Estimators. In this approach, we partition X into finitely many connected re- 
gions Xi, . . . , Xm. Within each Xj, we apply the glasso to get an estimated graph Gj. We 
then take G{x) = Gj for all x G Xj. To find the partition, we appeal to the idea used in 
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CART (classification and regression trees) (Breiman et al, 1984). We take the partition el- 
ements to be recursively defined hyperrectangles. As is well-known, we can then represent 
the partition by a tree, where each leaf node corresponds to a single partition element. In 
CART, the leaves are associated with the means within each partition element; while in our 
case, there will be an estimated undirected graph for each leaf node. We refer to this method 
as Graph-optimized CART, or Go-CART. The remainder of this paper is devoted to the 
details of this method. 

3. Graph-Optimized CART 

Let X G M'^ and F G be two random vectors, and let {(xi, yi), . . . , ?/„)} be n i.i.d. sam- 
ples from the joint distribution of (X, Y). The domains of X and Y are denoted by X and 
y respectively; and for simplicity we take X = [0, l]"^. We assume that 

Y\X = x^ Np{fi{x),E{x)) 

where : M'' — )■ is a vector-valued mean function and S : M'' — )■ W^^'^ is a matrix-valued 
covariance function. We also assume that for each x, Q{x) = S(x)~^ is a sparse matrix, i.e., 
many elements of Q{x) are zero. In addition, Q{x) may also be a sparse function of x, i.e., 
Q{x) = ^1{xfi) ioT some R C {1, . . . ,d} with cardinality ^ d. The task of graph-valued 
regression is to find a sparse inverse covariance Q{x) to estimate fl{x) for any x G A"; in 
some situations the graph of Q{x) is of greater interest than the entries of Q{x) themselves. 

Go-CART is a partition-based conditional graph estimator. We partition X into finitely 
many connected regions Xi, . . . , X^, and within each Xj we apply the glasso to estimate a 
graph Gj. We then take G{x) = Gj for all x & Xj. To find the partition, we restrict ourselves 
to dyadic splits, as studied by Scott and Nowak (2006) and Blanchard et al. (2007). The 
primary reason for such a choice is the computational and theoretical tractability of dyadic 
partition-based estimators. 

3.1. Dyadic Partitioning Tree 

Let T denote the set of dyadic partitioning trees (DPTs) defined over X = [0, 1]'', where each 
DPT T G T is constructed by recursively dividing X by means of axis-orthogonal dyadic 
splits. Each node of a DPT corresponds to a hyperrectangle in [0, 1]'^. If a node is associated 
to the hyperrectangle A = Y['i=i[^i^My then after being dyadically split along dimension k, 
the two children are associated with the sub-hyperrectangles 

Kk l>k 

Given a DPT T, we denote by n(T) = {Xi, . . . , X^j.} the partition of X induced by the leaf 
nodes of T. For a dyadic integer N = 2^ where K G {0, 1, 2, . . .}, we define Tn to be the 
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collection of all DPTs such that no partition has a side length smaller than . Let /(■) 
denote the indicator function. We denote firi^) and Q^i^) as the piecewise constant mean 
and precision functions associated with T: 



my 



I^t{x) = ^ f-ix, ■ I (x e Xj) and Vlxix) = ^ fi^', ■ I (x E Xj) , 
j=i i=i 

where fix^ G and Qx^ ^ W^^ are the mean vector and precision matrix for Xj. 



3.2. Go-CART: Risk Minimization Estimator 

Before formally defining our graph-valued regression estimators, we require some further 
definitions. Given a DPT T with an induced partition n(T) = {Xj}^^^ and corresponding 
mean and precision functions ^t{x) and ^^^(x), the negative conditional log-likelihood risk 
R{T, ht.^t) and its sample version R{T, ht.^t) are defined as follows: 



rriT 



R{T,ijt,^t) = J2^ 



R{T,fXT,nT) 



n 



EE 

i=i j=i 



tr 



tr 



^X, {{Y - fix^W - f,;,^f) - log \Qx, I ) ■ / (X G X 



{{Vi - f^x, ) iVi - fJ'X, )^) - log ll^A-J ) - I {xi e Xj 



(3.1) 
.(3.2) 



Let [[T]] > denote a prefix code over all DPTs T E Tn satisfying XItstIv ^ "^^^ — ^'^^ 
such prefix code [[T]] is proposed in (Scott and Nowak, 2006), and takes the form 

[[T]] = 3|n(T)| - 1 + (|n(T)| - 1) logrf/log2. 

A simple upper bound for [[T]] is 

[[T]]<(3 + logd/log2)|n(T)|. (3.3) 

Our analysis will assume that the conditional means and precision matrices are bounded in 
the II ■ II oo and || ■ ||i norms; specifically we suppose there is a positive constant B and a 
sequence Li,„, . . . , Lmj.,n, where each Lj „ G M"^ is a function of the sample size n, and we 
define the domains of each fix and fix as 



Aj = |r2 G W^^"^ : f2 is positive definite, symmetric, and ||f^||i < Lj^n} ■ (3-4) 
With this notation in place, we can now define two estimators. 

Definition 3.1. The penalized empirical risk minimization Go-CART estimator is defined 
as 



{^x.^^x, = argmin^gT-^^^^^gjv/, n;,^,GA, \ RiT, /it, fix) + pen(r) 



where R is defined in (3.2) and 



; [[T]]log2 + 21ogM 

pen(T) = 7„ ■ mT\ . 

V n 

Empirically, we may always set the dyadic integer to be a reasonably large value; the 
regularization parameter 7^ is responsible for selecting a suitable DPT T G Tn- Once T 
is chosen, the tuning parameters Li,„, . . . , Lmj.,n corresponding each partition element of T 
need to be determined in a data- dependent way. We will discuss further details about this 
in the next section. 

We can also formulate an estimator by minimizing held-out risk. Practically, we split 
the data into two partitions; we use Pi = . . . , (x„^,y„J} for training and V2 = 

{{{x'l, y[), . . . , {x'^^, y'n^))} for validation with ni+n2 = n. The held-out negative log-likelihood 
risk is then given by 

-Rout (7", /iT, ^r) = 

_ n2 mT 

- E E{ H - l^^M - l^^f)] - log ■ / e X,)]. (3.5) 

2 i=l j=l 

Definition 3.2. For each DPT T define 

/2t, = argmin^^^gjv^^._f^^^g^^._R(T, /iy, ^t) (3.6) 

where R is defined in (3.2) but only evaluated on T>i = {{xi, ?/i), . . . , {xm, Um)}- The held-out 
risk minimization Go-CART estimator is 

f = aigminrp^j-^RontiT, /ir, ^t)- 

where i?out is defined in (3.5) but only evaluated on V2. 

3.3. Go-CART: Greedy Partitioning 

The above procedures require us to find an optimal dyadic partitioning tree within Tn- 
Although dynamic programming can be applied, as in (Blanchard et ai, 2007), the compu- 
tation does not scale to large input dimensions d. We now propose a simple yet effective 
greedy algorithm to find an approximate solution (T, /ly, fir)- We focus on the held-out risk 
minimization form as in Definition 3.2, due to its superior empirical performance. But note 
that our greedy approach is generic and can easily be adapted to the penalized empirical 
risk minimization form. 

First, consider the simple case that we are given a dyadic tree structure T which induces a 
partition Il{T)={Xi, . . . , X^j.} on X. For any partition element Xj, we estimate the sample 
mean using Vi. 

^ ni 
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The glasso is then used to estimate a sparse precision matrix More precisely, let Tix^ be 
the sample covariance matrix for the partition element Xj, given by 

The estimator Qx^ is obtained by optimizing 

^Ixj = argmin{tr(S;t',r^) - log \^\ + Aj||i7||i}, 
n>-o 

where Xj is in one-to-one correspondence with Ljn in (3.4). In practice, we run the full 
regularization path of the glasso, from large \j, which yields very sparse graph, to small 
Xj, and select the graph that minimizes the held-out negative log-likelihood risk. To further 
improve the model selection performance, we refit the parameters of the precision matrix 
after the graph has been selected. That is, to reduce the bias of the glasso, we first esti- 
mate the sparse precision matrix using ^i-regularization, and then we refit the Gaussian 
model without £i-regularization, but enforcing the sparsity pattern obtained in the first 
step. Liu, Lafferty and Wasserman (2010) demonstrate that such a refitting step will yield a 
significantly better model selection performance when estimating graphs. 

The natural, standard greedy procedure starts from the coarsest partition X = [0, 1]'^ and 
then computes the decrease in the held-out risk by dyadically splitting each hyp errect angle 
A along dimension k G {1,. . .d}. The dimension k* that results in the largest decrease in 
held-out risk is selected. More precisely, let slfc(^) be the side length of A on the dimension 
k. If slfc(^) > , where K = logg A^, we dyadically split A along the dimension k. In this 
case, let A^l^^ and be the two resulting sub-hyperrectangles. The decrease in held-out 
risk takes the form 

^R^out{-^yl''A,^A) = RontiA,JlA,^A) — -Rout (-4.^''\ , ^ ^{fe) ) " -Rout (-A^\ /i . (fc) , . (fc) ) . 

(3.7) 

Note that if splitting any dimension k of A leads to an increase in the risk, we set a Boolean 
variable S{A) = False which indicates that the partition element A should no longer be 
split and hence A should be a partition element of n(T). The greedy Go-CART, as presented 
in Algorithm 3.1, recursively applies the previous procedure to split each partition element 
until all the partition elements cannot be further split. Note that we also record the dyadic 
partition tree structure in the implementation. 

This greedy partitioning method parallels the classical algorithms for classification and 
regression trees that have been used in statistical learning for decades. However, the strength 
of the procedures given in Definitions 3.1 and 3.2 is that they lend themselves to a theo- 
retical analysis under relatively weak assumptions, as we show in the following section. The 
theoretical properties of greedy Go-CART are left to future work. 
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Algorithm 3.1 Greedy Go-CART using Dyadic Partitioning 



Input: training data {xi,yi}^li, held-out validation data {x[,y'j}^^^, and an integer K 
Start from X = [0, l]'^. Set the Boolean variable S{X) = True and estimate J2x, 
while exists a hyperrectangle A such that S{A) = True do 
for all dimensions k € {1, . . . d} do 
if slfc(^) > 2--f^+i then 

Calculate AR''J^^{A,flA,^A) 

according to (3.7) 

else 

Set Ai?iS {A JIa, f^^) = -oo 
Determine the best splitting dimension k* = argmax^gj^^ AR'^J^^{A,'fl_A, ^a) 
if ARl^2{A,JlA,nA)>0then 

(k*) (k*) 

Dyadically split A along dimension k*, yielding two hyperrectangles A^ and Aj^ . Esti- 
mate /i (fc*),r2 (fc*),/i (fc.),r2 (fc*) and set S{A^^ ■*) = S{A^n ^) = True. 

•^L -^L -^R '^R 

else 

Set «S'(^) = False and put A into the final partition set. 
Output: Partition n(T) = {Xj}^^^ and the corresponding DPT T with the estimated Jj-Xj- 



4. Theoretical Properties 

We define the oracle risk R* over T/v as 



R* = R(T*, fit, VIt) = inf R(T, fiT, ^t). 

Note that T*, fi^,, and ^2^^, might not be unique, since the finest partition always achieves the 
oracle risk. To obtain oracle inequalities, we make the following two technical assumptions. 

Assumption 4.1. Let T G T/v be an arbitrary DPT which induces a partition n(T) = 
{Xi, . . . , Xmj.} on X, we assume that there exists a constant B, such that 

max ll/i;^ lloo ^ B and max sup log 1^2 1 < L„ 

l<j<mT ^ l<j<mT HgAj 

where Aj is defined in (3.4) and L„ = maxi<j<mT Lj^n, where Lj^n is the same as in (3.4). We 
also assume that 

L„ = o{^/n). 

Assumption 4.2. Let Y = {Yi, . . . , Yp)'^ e W. For any ^ C A", we define 

ZkeiA) = YkYe-I{X eA)-E{YkYe-I{X eA)) 
Zj{A) = Yj-I{X eA)-E{Yj-I{X eA)). 

We assume there exist constants Mi,M2,wi, and V2, such that 

supE\Zke{A)r < 1 and supE|Zj(^)|™ < ]- 

k,£,A 2 j^A 2 
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for all m > 2. 

Theorem 4.1. Let T ^ Tn be a DPT that induces a partition n(T) = {Xi, . . . , X^j.} on X . 
For any 6 G (0, 1), let T^Jlf^Qf be the estimator obtained using the penalized empirical risk 
minimization Go-CART in Definition 3.1, with a penalty term pen(T) of the form 



(r ^^^T /[[T]] log2 + 21ogp + log(48/(5) 
pen(T) = (Ci + l)L„mTi ' 



n 

?2 



where Ci = + 8B^yvl + B . Then for sufficiently large n, the excess risk inequality 
R(f, u^, Q^) -R* < inf \ 2pen(T) + inf (R(T, ut, Qt) - R*) 

holds with probability at least 1 — S. 

A similar oracle inequality holds when using the held-out risk minimization Go-CART. 

Theorem 4.2. Let T G Tn be a DPT which induces a partition n(T) = {Xi, . . . , Xmj.} on 
X. For any 6 G (0, 1), we define 4>n{T) to be a function of n and T: 



^ m (r M f^\T / [m]log2 + 21ogp + log(384/^) 

where C2 = 8y/2v^ + 8By/2vl + y/2B^ and Ln = maxi<j<rnT Lj,n- Partition the data into 
T^i = {(^i^^i)' • • • ' (^^ni'^ni)} (^'^d V2 = {{x[,y[), . . . , {x'^^^y'n^)} With sizcs m = n2 = n/2. 
Let T,fif,Qf be the estimator constructed using the held-out risk minimization criterion of 
Definition 3.2. Then, for sufficiently large n, the excess risk inequality 

R{f, fif, Qf) -R* < inf \ 3<Pn{T) + inf {R{T, fir, ^t) - R*)\ + (f>n{f) 

holds with probability at least 1 — 5. 

Note that in contrast to the statement in Theorem 4.1, Theorem 4.2 results in a stochastic 
upper bound due to the extra (f)n{T) term, which depends on the complexity of the final 
estimate T. The proofs of both theorems are given in the appendix. 

We now temporarily make the strong assumption that the model is correct, so that Y 
given X is conditionally Gaussian, with a partition structure that is given by a dyadic tree. 
We show that with high probability, the true dyadic partition structure can be correctly 
recovered. 

Assumption 4.3. The true model is 

Y\X = x^ Np{fi*T,{x),nT'{x)) (4.1) 
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where T* e 7^ is a DPT with induced partition n(r*) = {X^}]"!^ and 

Under this assumption, clearly 

R{T*,n*T.,n*T,) = inf R{T,fiT,nT), 

where A^t is given by 

Mt = {fi{x) = Y^f^x, I{x G Xj), n{x) = Vtx^ I{x G Xj) : 
i=i i=i 

where /i;,^, G M,, fi;,^. G A,-, U{T) = {A'.jfJi}. 

Let Ti and T2 be two DPTs, if n(Ti) can be obtained by further split the hyperrectangles 
within n(T2), we say n(T2) C n(Ti). We then have the following definitions: 

Definition 4.1. A tree estimation procedure T is tree partition consistent in case 

P (n{T*) C n(f )) ^1 as n ^ 00. 

Note that the estimated partition may be finer than the true partition. Establishing a tree 
partition consistency result requires further technical assumptions. The following assumption 
specifies that for arbitrary adjacent subregions of the true dyadic partition, either the means 
or the variances should be sufficiently different. Without such an assumption, of course, it is 
impossible to detect the boundaries of the true partition. 

Assumption 4.4. Let X^ and Xj be adjacent partition elements of T*, so that they have 
a common parent node within T*. Let Zl*p = (f2*p)~^. We assume there exist positive 

i i 

constants Ci, C2, C3, C4, such that either 

- log |Elo| - log |Et.o| > C4 
or — /it-olli — ^3- ^Iso assume 

Pmin(fi^o) > Ci, Vj = 1, . . . , my., 

i 

where Pmin(') denotes the smallest eigenvalue. Furthermore, for any T & Tn and any A G 
n(r), we have P (X G ^) > C2. 



2 log 
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Theorem 4.3. Under the above assumptions, we have 



inf inf R(T, Ut, ^t) — R(T*, uX,,, Vt%.*) > minj^^^^— C2C4} 

TeTiv, n(T*)^n{T) /itMt&Mt 2 

where ci, C2, C3, C4 are defined in Assumption 4-4- Moreover, the Go-CART estimator in both 
the penalized risk minimization and held-out risk minimization form is tree partition consis- 
tent. 

This resuh shows that, with high probability, we obtain a finer partition than T*; the 
assumptions do not, however, control the size of the resulting partition. The proof of this 
result appears in the appendix. 



5. Experimental Results 

We evaluate the performance of the greedy Go-CART learning algorithm in Section 3.3 on 
both synthetic datasets and a meteorological dataset. In each experiment, we set the dyadic 
integer to = 2^° to ensure that we can obtain fine-tuned partitions of the input space 
X. Furthermore, we always ensure that the region (hyperrectangle) represented by each leaf 
node contains at least 10 data points to guarantee reasonable estimates of the sample means 
and sparse inverse covariance matrices. 



5.1. Synthetic Data 

We generate n data points Xi, . . . , a;„ G M'' with n = 10, 000 and d = 10 uniformly distributed 
on the unit hypercube [0, l]*^. We split the square [0, 1]^ defined by the first two dimensions 
into 22 subregions, as shown in Figure 1(a). For the t-th subregion where 1 < t < 22, we 
generate an Erdos-Renyi random graph G* = with p = 20 vertices and l^'l = 10 

edges, with maximum node degree four. As an illustration, the random graphs for subregion 
four (the smallest region), 17 (middle region) and 22 (large region) are presented in Figures 
1(b), (c) and (d), respectively. For each graph G*, we generate an inverse covariance matrix 
f2* according to: 

f 1 if z = J, 
^1, = I 0.245 if(2,j)eE*, 
I otherwise, 

where 0.245 guarantees positive-definiteness of f2* when the maximum node degree is four. 

To each data point Xj in the t-th subregion we associate a 20-dimensional response vector 
Ui generated from a multivariate Gaussian distribution A^2o (O, (f^*) ). We also create an 
equally-sized held-out dataset in the same manner based on {fi*}^^^ 

We apply Algorithm 3.1 to this synthetic dataset. The estimated dyadic tree structure and 
its induced partitions are presented in Figure 2. Estimated graphs for some nodes are also 
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(a) (b) (c) (d) 

Fig 1- (a) The 22 subregions defined on [0, 1]^. The horizontal axis corresponds to the first dimension 
denoted as Xi while the vertical axis corresponds to the second dimension denoted as X2. The bottom 
left point corresponds to [0, 0] and the upper right point corresponds to [1, 1]. (b) The true graph for 
subregion 4- (c) The true graph for subregion 17. (d) The true graph for subregion 22. 

illustrated. Note that the label for each subregion in subplot (c) is the leaf node ID of the tree 
in subplot (a). We conduct 100 Monte-Carlo simulations and find that in 82 out of 100 runs 
our algorithm perfectly recovers the ground truth partition of the X1-X2 plane, and never 
wrongly splits on any of the irrelevant dimensions, ranging from X^ to Xiq. Moreover, the 
estimated graphs have interesting patterns. Even though the graphs within each subregion 
are sparse, the estimated graph obtained by pooling all the data together is highly dense. 
As the algorithm progresses, the estimated graphs become more sparse. However, for the 
immediate parent nodes of the true subregions, the graphs become denser again. 

Out of the 82 simulations where we correctly identify the tree structure, we list the graph 
estimation performance for subregions 1, 4, 17, 18, 21, 22 in terms of precision, recall, and 
Fi-score. Let E be the estimated edge set while E be the true edge set. These criteria are 
defined as: 

\E f^ E\ \E f\ E\ precision ■ recall 

precision = — — , recall = — — — , Fi-score = 2 ■ -. (5.1) 

\E\ precision + recall 

We see that for a larger subregion, it is easier to obtain better recovery performance, while 
good recovery for a very small region is more challenging. Of course, in the smaller regions 
there is less data. In Figure 1(a), there are only 10000/64 ^ 156 data points that appear in 
subregion 1 (the smallest one). In contrast, approximately 10000/16 = 625 data points fall 
inside subregion 18, so that the graph corresponding to this region can be better estimated. 

We also plot the held-out risk in the subplot (c). As can be seen, the first few splits lead 
to the most significant decrease in the held-out risk. 

Further simulations where the ground truth covariance matrix is a continuous function 
of X are presented in the appendix. 
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Fig 2. (a) The estimated dyadic tree structure; (b) the induced partition on [0,1]^ and the number labeled 
on each subregion corresponds to each leaf node ID of the tree in (a); (c) the held-out negative log-likelihood 
risk for each split. The order of the splits corresponds the ID of the tree node (from small to large) 



Table 1 

Graph estimation performance over different subregions 





Mean values over 100 


runs (Standard deviation) 






subregion 


region 1 


region 4 


region 17 region 18 


region 21 


region 22 


Precision 
Recall 
Fi — score 


0.8327 (0.15) 
0.7890 (0.16) 
0.7880 (0.11) 


0.8429 (0.15) 
0.7990 (0.18) 
0.7923 (0.12) 


0.9821 (0.05) 0.9853 (0.04) 
1.0000 (0.00) 1.0000 (0.00) 
0.9904 (0.03) 0.9921 (0.02) 


0.9906 (0.04) 
1.0000 (0.00) 
0.9949 (0.02) 


0.9899 (0.05) 
1.0000 (0.00) 
0.9913 (0.02) 



5.2. Climate Data Analysis 

In this section, we use graph- valued regression to analyze a meteorology dataset (Lozano et ai, 
2009) that contains monthly data of 18 different meteorological factors from 1990 to 2002. 
We use the data from 1990 to 1995 as the training data and the data from 1996 to 2002 as the 
held-out validation data. The observations span 125 locations in the US on an equally spaced 
grid between latitude 30.475 and 47.975 and longitude -119.75 to -82.25. The 18 meteorolog- 
ical factors measured for each month include levels of CO2, CH4, H2, CO, average temperature 
(TMP) and diurnal temperature range (DTR), minimum temperate (TMN), maximum tem- 
perature (TMX), precipitation (PRE), vapor (VAP), cloud cover (CLD), wet days (WET), frost 
days (FRS), global solar radiation (GLO), direct solar radiation (DIR), extraterrestrial radia- 
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Fig 3. Analysis of the climate data, (a) Estimated partitions for 125 locations projected to the US 
map, with the estimated graphs for suhregions 2, 3, and 65; (h) estimated graph with data pooled 
from all 125 locations; (c) the re-scaled partition pattern induced by the dyadic tree structure. 



tion (ETR), extraterrestrial normal radiation (ETRN) and UV aerosol index (UV). For further 
detail, see Lozano et al. (2009). 

As a baseline, we estimate a sparse graph on the data from all 125 locations, using the 
glasso algorithm; the estimated graph is shown in Figure 3 (b). It is seen that there is no 
edge connecting to any of the greenhouse gas factors CO2, CH4, H2 or CO. This apparently 
contradicts basic domain knowledge that these four factors should correlate with the solar 
radiation factors (including GLO, DIR, ETR, ETRN, and UV), according to the 2007 report 
of the Intergovermental Panel on Climate Change IPCC (2007). The reason for the missing 
edges in the pooled data may be that positive correlations at one location are canceled by 
negative correlations at other locations. 

Treating the longitude and latitude of each site as two-dimensional covariate X, and 
the meteorology data of the p = 18 factors as the response y, we estimate a dyadic tree 
structure using the greedy algorithm. The result is a partition with 87 subregions, shown 
in Figure 3, with the corresponding dyadic partition tree is shown in Figure 4. The graphs 
for subregion 2 (corresponding to the strip of land from Los Angeles, California to Phoenix, 
Arizona) and subregion 3 (Bakersfield, California to Flagstaff, Arizona) are shown in subplot 
(a) of Figure 3. The graphs for these two adjacent subregions are quite similar, suggesting 
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Fig 4. The estimated dyadic tree structure on the climate data. 



spatial smoothness of the learned graphs. Moreover, for both graphs, CO is connected to 
solar radiation factors in either a direct or indirect way, and H2 is connected to UV, which 
is accordance with Chapter 7 of the IPCC report IPCC (2007). In contrast, for subregion 
65, which corresponds to the border of South Dakota and Nebraska; here the graph is quite 
different. In general, it is found that the graphs corresponding to the locations along the 
coasts are sparser than those corresponding to more central locations in the mainland. 

Such observations, which require validation and interpretation by domain experts, are 
examples of the capability of graph- valued regression to provide a useful tool for high di- 
mensional data analysis. 

6. Conclusions 

In this paper, we present Go-CART, a partition-based estimator of the family of undirected 
graphs associated with a high dimensional conditional distribution. Dyadic partitioning es- 
timators, either using penalized empirical risk minimization or data splitting, are attractive 
due to their simplicity and theoretical guarantees. We derive finite sample oracle inequali- 
ties on excess risk, together with a tree partition consistency result. Our theory allows the 
scale of the graphs to increase with the sample size, which is relevant since the methods 
are targeted at high dimensional data analysis applications. Greedy partitioning estimators 
are proposed that are computationally attractive, combining classical greedy algorithms for 
decision trees with recent advances in £i-regularization techniques for graph selection. The 
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practical potential of Go-CART is indicated by experiments on a meteorology dataset. A 
theoretical analysis of greedy Go-CART is one of several interesting directions for future 
work. 



Appendix A: Proofs of Technical Results 
A.l. Proof of Theorem 4-i 

For any T eTn, denote 

1 " 



1=1 



(A.l) 
(A.2) 



We then have 



R{T,flT,^T)-RiT,fiT,^T) 

TTl lit ^ lb 



< 



i=l 



A, 



I oo 



n 



1=1 



(A.3) 
(A.4) 



We now analyze the terms Ai and A2 separately. 

For A2, using the Hoeff ding's inequality, for e > 0, we get 



P 

which implies that, 

P 



n \ 

- Y ^(^« ^ ^i) - ^^(^ ^ '^i) > e < 2 exp (-2ne2) 



(A.5) 



( sup - Y^i^i ^ ^i) - ^^(^ ^ ^j) /er > 1 I < 2 ^ exp (-2ne|) 



(A.6) 



where means e is a function of T. For any 5 G (0, 1), we have, with probability at least 
1 - 5/4, 



1 " 

- ^/(x, G A'j) - E/(X G 
where [[T]] > is the prefix code of T given in (3.3). 



< 



[[T]]log2 + log(8/5) 
2n 



(A.7) 
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From Assumption 4.1, since ^j, we have that 



max log \ QxA < 

l<j<mT 



Therefore, with probabihty at least 1 — 5/4, 



A2 < LnTTlT 

Next, we analyze the term Ai. Since 



[[T]]log2 + log(8/,5) 
2n 



max II A', 111 < Lr, 

l<j<mT 



(Ai 



(A.9) 



(A.IO) 



we only need to bound the term \\Sj^n — Sj\\ao- By Assumption 4.2 and the union bound, we 
have, for any e > 0, 



^iP,,n-S,\\^>e) 
f\\ 1 " 

/ii 1 " 

/|l 1 " 

VII i=i 



e 

00 4 



00 ^ 4 



00 ^ 4 



e 

00 4 



(A.ll) 
(A.12) 
(A.13) 
(A.14) 



Using the fact that jl/ijloo ^ and the Assumption 4.2, we can apply Bernstein's exponential 
inequality on (A.ll), (A.12), and (A.13). Also, since the indicator function is bounded, we 
can apply Hoeff ding's inequality on (A.14). In this way we obtain 



^i\K.-S,\\^>e) 



< 2p exp 



1 / ne 



32 V ^'2 + 



+ 4p exp 



1 



ne 



3252 \vi + Mie 



2p exp 



(A.15) 



Therefore, for any 5 G (0, 1), we have, for any e — t- as n goes to infinity, with probability 
at least 1 — 5/4 



VT G Ttv? Pj.n — 'S'JI _ < 



. / [[T]]log2 + 21ogp + log(24/5) ^^^^g^ 



n 



+ (8B^)../MMi±li2i^±M!M ,A.17) 



n 



^2 ,;[[T]]log2 + 21ogp + log(24/5) 



2n 



(A.18) 
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Combined with (A. 10), we get that 



where Ci = + 85^^ + 5^. 

Since the above analysis holds uniformly over Tn, when choosing 



m rr^^iu ; [[T]]log2 + 21ogp + log(48A) 

pen(T) = (Ci + l)L„mrV , (A.20) 

V n 

we then get, with probability at least 1 — 5/2, 



sup 

Te7lv,MjeAfj,r2jGAj 

for large enough n. 

Given a DPT T, we define 



R{T,fiT,^T) - R{T,fiT,^T) <pen(T) (A.2i; 



/i°,fi°= argmin R{T,iit,^t)- (A.22) 

From the uniform deviation inequality in (A. 21), we have, for large enough n: for any S G 
(0, 1), with probability at least 1 — 5, 

R{f,'j2f,Qf) < ^(f,/2^,%) +pen(f) (A.23) 



inf iR(T,UTMT)+pe^iT)> (A.24) 

< inf |^(T,/i°,f]°) +pen(T)| (A.25) 

< inf {i?(T,/i?.,l]5.) + 2pen(T)} (A.26) 

= inf J inf (R(T,fXT,^T) + 2pen(T)\ . (A.27) 
The desired result of the theorem follows by subtracting R* from both sides. 
A. 2. Proof of Theorem 4.2 

From (A. 21) we have, for large enough n, on the dataset Vi, with probability at least 1 — 5/4 



sup 



RiT,ixT,nT)-R{T,f,T,nT) <MT). (A.28) 



Following the same line of analysis, we can also get that on the validation dataset P2, with 
probability at least 1 — 5/4, 



sup 

TeTN 



R{T,ltTMT) - RontiT,JlT,^T) < MT) (A.29) 
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for large enough n. Here /xt, defined in (3.6). 

Given a DPT T, we define 

/i°,l]°= argmin R{T,fiT,^T)- (A.30) 

Using the fact that 

f = argmin^g7-^^out(r, /it, ^t), (A.31) 
we have, for large enough n and any 6 G (0, 1), with probability at least 1 — 5, 

R{f,Jlf,hf) < ^out(T,/2^,%) +0„(f) (A.32) 

= M Ront{T,-[lTM + MT) (A.33) 

TeTN 

< M \r{T,JIt,(It) + MT)]+MT) (A.34) 

< inf {r{T,i2tM + MT) + MT)}+MT) (A.35) 

< inf {i?(T,/i°,fiO) + 0„(T) + 0„(T)|+0„(f) (A.36) 



< inf <^ 30„(T) + inf i?(T, /x^, ^t) } + 0n(T). 
The result follows by subtracting R* from both sides. 
A. 3. Proof of Theorem 4.3 

For any T G Tn, n(T*) ^ n(T), there must exist a subregion X' G n(T) such that no 
A G n(T*) satisfies X' C We can thus find a minimal class of disjoint subregions 
{A'O, . . . , } G n(T*), such that 

A" C utiXj", (A.37) 

where A;' > 2. We define A"-* = X° fl A" for i = 1, . . . , k'. Then we have 

A" = utiX*. (A.38) 

Let {/it*, fit.jfli be the true parameters on X^, . . . ,Xu. We denote by R{X', fL^*,Q^,) 
the risk of /i^* and on the subregion X , so that 

k' r- 

R{x',fi*T.,Q*^,) = (tr [i]:^^((F-/i:^o(>'-/":^;r)] -iogi^^:^;i) -/(xg a;* 

fc' 

= pP(X G A") - J]P(^ G A';)log|fi;^.|. (A.39) 

i=i 
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Since the DPT T does not further partition X', we have, for any fi^, ^ -^r 
k' p 

= {tT[nT{{Y-iiT){Y-iiTf)]-\og\nr\)-I{XeX*) 



k' 



(tr [^Ir {iY - f^rW - f^rf)]) ■ I{X e X;) 



P(X G X') loglf^rl 



Using the decomposition 



(A.40) 



we obtain 



r 

5^E (tr -/(xe a;* 



fe' 



Using the bound 

R{X',fXT,^T) > raax{R{X',fx*T,,nT),RiX',fXT,^T')}, 
we proceed by cases. 

Case 1: The /i's are different. We know that 

inf R{X',fiT,^T) - RiX',iJ,*j.,,n*T.) 



{AAl] 



(A.42) 



(A.43) 



> miR{X',iJT,^T*) - R{X',fiT*,n*T: 



k' 



= inf ^ P (X G X*) (/i^, - f^rfn*,, (/i^. - /i^) 

fix ' 3 3 3 

k' 

> ciC2inf ll/^;^, - yUrlls 

Mt ' 3 



where the last inequahty follows from that fact that Pmin(^>'*) > ci,P (X G A"*) > C2. It's 

3 ■' 

easy to see that a lower bound of the last term is achieved at fiT, 
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(A.44) 
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Furthermore, for any two DPTs T and T', if n(T) C n(T') it's clear that 

inf R{T,fiT,^T)> inf R(T' , ^t' , ^t') ■ 



(A.45) 



Therefore, in the sequel, without loss of generality we only need to consider the case k' = 2. 
The result in this case then follows from the fact that 



(A.46) 



Case 2: The f2's are different. In this case, we have 



inf R{X',fiT,^T) - R{X',fi*j„,n*j,.) >MR{X',fi*T.,nT)-R{X',fi*T.,n*T.) 
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(A.48) 
(A.49) 
(A.50) 



X* 



where St = '^'^ . 

As before, we only need to consider the case /c' = 2. A lower bound of the last term is 
achieved at 

S V* + S V. 



(A.51) 



Plugging in St, we get 
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where the last inequahty follows from the given assumption. 
Therefore, we have 



inf R{X', Ht, ^t) — R{X', /i^* , fi^* ) > C2C4. 



(A.56) 



The theorem is obtained by combining the two cases. 
Appendix B: Further Simulations 

To further demonstrate the performance of the method, this section presents simulations 
where the true conditional covariance matrix is continuous in X. We compare the graphs 
estimated by our method to the single graph obtained by applying the glasso directly to the 
entire dataset. 

In this subsection, we consider the case where X lies on a one dimensional chain. More 
precisely, we generate n equally spaced points Xi, . . . , a;„ G M with n = 10, 000 on [0, 1]. We 
generate an Erdos-Renyi random graph = {V^, E^) with p = 20 vertices, \E\ = 10 edges, 
and maximum node degree four. Then, we simulate the output . . . , G as follows: 

1. For t = 2 to T, we construct the graph G* = (K*, E^) as follows: (a) with probability 
0.05, remove one edge from and (b) with probability 0.05, add one edge to the 
graph generated in (a). We make sure that the total number of edges is between 5 and 
15, and that the maximum node degree four. 

2. For each graph G*, generate the inverse covariance matrix fi*: 



where 0.245 guarantees positive-definiteness of fi* under the degree constraint. 
3. For each t, we sample yt from a multivariate Gaussian distribution with mean fi = 
(0, . . . , 0) G and covariance matrix S* = 

We generate an equal-sized held-out dataset in the same manner, using the same fi and 
S*. Greedy Go-CART is used to estimate the dyadic tree structure and corresponding inverse 
covariance matrices; these are displayed in Figure 5. 

B.l. Chain Structure 

To examine the recovery quality of the underlying graph structure, we compare our esti- 
mated graphs to the graph estimated by directly applying the glasso to the entire dataset. 
Comparisons in terms of precision, recall and Fi-score are given in Figure 6 (a), (b) and 
(c) respectively. As we can see, the partition-based method achieves much higher precision 




1 if i = j, 
0.245 if{i,j)EE\ 
otherwise. 
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(a) (b) 

Fig 5. (a) Estimated tree structure; (h) corresponding partitions 



and Fi-Score. As for recall, glasso is slightly better, due to the fact that the glasso graphs 
estimated on the entire data are very dense, as shown in 6 (d). The dense graphs lead to 
fewer false negatives (thus large recall) but many false positives (thus small precision). 

B.2. Two-way Grid Structure 

In this section, we apply Go-CART to a two dimensional design X. The underlying graph 
structures and Y are generated in manner similar to that used in the previous section. In 
particular, we generate equally spaced xi, . . . G with n = 10,000 on the unit two- 
dimensional grid [0, 1]^. We generate an Erdos-Renyi random graph G^'^ = (V^'^ , E^'^) with 
p = 20 vertices, \E\ = 10 edges, and maximum node degree four, then construct the graphs 
for each x along diagonals. More precisely, for each pair of where 1 < i < 100 and 
1 < j < 100, we randomly select either G^~^'^ (if it exists) or G^'^~^ (if it exists) with equal 
probability as the basis graph. Then, we construct the graph G*'-' = {V^'^,E'^'^) by removing 
one edge and adding one edge with probability 0.05 based on the selected basis graph, taking 
care that the number of edges is between 5 and 15 and the maximum degree is still four. 
Given the underlying graphs, we generate the covariance matrix and output Y in the same 
way as in the last section. 

We apply the greedy algorithm to learn the dyadic tree structure and corresponding 
inverse covariance matrices, shown in Figure 7. We plot the Fi-score obtained by glasso on 
the entire data compared against the our method in Figure 8. It is seen that for most x, the 
partitioning method achieves significantly higher Fi-score than directly applying the glasso. 
Note that since the graphs near the middle part of the diagonal (the line connecting [0, 1] 
and [1,0]) have the greatest variability, the Fi-scores for both methods are low in this region. 
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(a) (b) 

Fig 8. (a) Color map of Fi- score for glasso run on the entire dataset; (h) color map of Fi- score for Go- 
CART. Red indicates large values (approaching 1) and blue indicates small values (approaching 0), as shown 
in the color bar. 



24 



References 

Banerjee, O., Ghaoui, L. E. and d'Aspremont, A. (2008). Model selection through 
sparse maximum likelihood estimation. Journal of Machine Learning Research 9 485-516. 

Blanchard, G., Schafer, C, Rozenholc, Y. and Muller, K.-R. (2007). Optimal 
dyadic decision trees. Mach. Learn. 66 209-241. 

Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. (1984). Classification and 
regression trees. Wadsworth Publishing Co Inc. 

Edwards, D. (1995). Introduction to graphical modelling. Springer- Verlag Inc. 

Friedman, J. H., Hastie, T. and Tibshirani, R. (2007). Sparse inverse covariance esti- 
mation with the graphical lasso. Biostatistics 9 432-441. 

IPCC, (2007). Climate Change 2007-The Physical Science Basis IPCC Fourth Assessment 
Report. 

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. 

Liu, H., Lafferty, J. and Wasserman, L. (2010). Tree Density Estimation. 

arXiv:1001.1557vl [stat.ML] 10 Jan 2010. 
LozANO, A. C, Li, H., Niculescu-Mizil, A., Liu, Y., Perlich, C, Hosking, J. and 

Abe, N. (2009). Spatial-temporal causal modeling for climate change attribution. In ACM 

SIGKDD. 

Ravikumar, p., Wainwright, M., Raskutti, G. and Yu, B. (2009). Model Selection 
in Gaussian Graphical Models: High-Dimensional Consistency of £i-regularized MLE. In 
Advances in Neural Information Processing Systems 22. MIT Press, Cambridge, MA. 

Rothman, a. J., BiCKEL, p. J., Levina, E. and Zhu, J. (2008). Sparse permutation 
invariant covariance estimation. Electronic Journal of Statistics 2 494-515. 

Scott, C. and Nowak, R. (2006). Minimax-optimal classification with dyadic decision 
trees. Information Theory, IEEE Transactions on 52 1335-1353. 

Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley. 

Yuan, M. and Lin, Y. (2007). Model selection and estimation in the Gaussian graphical 
model. Biometrika 94 19-35. 

Zhou, S., Lafferty, J. and Wasserman, L. (2010). Time Varying Undirected Graphs. 
Machine Learning 78. 



25 



